Improving Bilingual Projections via Sparse Covariance Matrices
نویسندگان
چکیده
Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.
منابع مشابه
Estimation of the sample covariance matrix from compressive measurements
This paper focuses on the estimation of the sample covariance matrix from low-dimensional random projections of data known as compressive measurements. In particular, we present an unbiased estimator to extract the covariance structure from compressive measurements obtained by a general class of random projection matrices consisting of i.i.d. zero-mean entries and finite first four moments. In ...
متن کاملFactored sparse inverse covariance matrices
Most HMM-based speech recognition systems use Gaussian mixtures as observation probability density functions. An important goal in all such systems is to improve parsimony. One method is to adjust the type of covariance matrices used. In this work, factored sparse inverse covariance matrices are introduced. Based on U DU factorization, the inverse covariance matrix can be represented using line...
متن کاملMore powerful tests for sparse high-dimensional covariances matrices
This paper considers improving the power of tests for the identity and sphericity hypotheses regarding high dimensional covariance matrices. The power improvement is achieved by employing the banding estimator for the covariance matrices, which leads to significant reduction in the variance of the test statistics in high dimension. Theoretical justification and simulation experiments are provid...
متن کاملA Well-Conditioned and Sparse Estimation of Covariance and Inverse Covariance Matrices Using a Joint Penalty
We develop a method for estimating well-conditioned and sparse covariance and inverse covariance matrices from a sample of vectors drawn from a sub-Gaussian distribution in high dimensional setting. The proposed estimators are obtained by minimizing the quadratic loss function and joint penalty of `1 norm and variance of its eigenvalues. In contrast to some of the existing methods of covariance...
متن کامل0 Sparse Inverse Covariance Estimation
Recently, there has been focus on penalized loglikelihood covariance estimation for sparse inverse covariance (precision) matrices. The penalty is responsible for inducing sparsity, and a very common choice is the convex l1 norm. However, the best estimator performance is not always achieved with this penalty. The most natural sparsity promoting “norm” is the non-convex l0 penalty but its lack ...
متن کامل